{
 "cells": [
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "# 使用 DeepSeek R1 和 Ollama 实现本地 RAG 应用",
   "id": "20789d396d1ac375"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "## 文档加载",
   "id": "73b4267dedd40548"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "加载并分割 PDF 文档，这里以 DeepSeek_R1.pdf 为例。",
   "id": "313b32846e6c5241"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-01-31T11:07:03.535329Z",
     "start_time": "2025-01-31T11:07:01.194044Z"
    }
   },
   "cell_type": "code",
   "source": [
    "from langchain_community.document_loaders import PDFPlumberLoader\n",
    "\n",
    "file = \"DeepSeek_R1.pdf\"\n",
    "\n",
    "# Load the PDF\n",
    "loader = PDFPlumberLoader(file)\n",
    "docs = loader.load()\n",
    "\n",
    "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
    "text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)\n",
    "all_splits = text_splitter.split_documents(docs)"
   ],
   "id": "f6259c33e03b7e86",
   "outputs": [],
   "execution_count": 1
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "接着，初始化向量存储。 我们使用的文本嵌入模型是 `nomic-embed-text` 。",
   "id": "b314596ac84e5c54"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-01-31T11:07:06.695928Z",
     "start_time": "2025-01-31T11:07:03.536580Z"
    }
   },
   "cell_type": "code",
   "source": [
    "from langchain_chroma import Chroma\n",
    "from langchain_ollama import OllamaEmbeddings\n",
    "\n",
    "local_embeddings = OllamaEmbeddings(model=\"nomic-embed-text\")\n",
    "\n",
    "vectorstore = Chroma.from_documents(documents=all_splits, embedding=local_embeddings)"
   ],
   "id": "85f7eadabdab1251",
   "outputs": [],
   "execution_count": 2
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "现在，我们得到了一个向量存储，可以用来进行相似度搜索。",
   "id": "ea0b8d90991038c4"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-01-31T11:07:06.742131Z",
     "start_time": "2025-01-31T11:07:06.696941Z"
    }
   },
   "cell_type": "code",
   "source": [
    "question = \"What is the purpose of the DeepSeek project?\"\n",
    "docs = vectorstore.similarity_search(question)\n",
    "for doc in docs:\n",
    "    print(doc.page_content)"
   ],
   "id": "28989e64549cb75e",
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "engineeringtasks. Asaresult,DeepSeek-R1hasnotdemonstratedahugeimprovement\n",
      "over DeepSeek-V3 on software engineering benchmarks. Future versions will address\n",
      "thisbyimplementingrejectionsamplingonsoftwareengineeringdataorincorporating\n",
      "asynchronousevaluationsduringtheRLprocesstoimproveefficiency.\n",
      "16\n",
      "DeepSeek-R1avoidsintroducinglengthbiasduringGPT-basedevaluations,furthersolidifying\n",
      "itsrobustnessacrossmultipletasks.\n",
      "On math tasks, DeepSeek-R1 demonstrates performance on par with OpenAI-o1-1217,\n",
      "surpassingothermodelsbyalargemargin. Asimilartrendisobservedoncodingalgorithm\n",
      "tasks,suchasLiveCodeBenchandCodeforces,wherereasoning-focusedmodelsdominatethese\n",
      "benchmarks. Onengineering-orientedcodingtasks,OpenAI-o1-1217outperformsDeepSeek-R1\n",
      "first open research to validate that reasoning capabilities of LLMs can be incentivized\n",
      "purelythroughRL,withouttheneedforSFT.Thisbreakthroughpavesthewayforfuture\n",
      "advancementsinthisarea.\n",
      "• We introduce our pipeline to develop DeepSeek-R1. The pipeline incorporates two RL\n",
      "stagesaimedatdiscoveringimprovedreasoningpatternsandaligningwithhumanpref-\n",
      "erences, as well as two SFT stages that serve as the seed for the model’s reasoning and\n",
      "DeepSeek-R1,whichincorporatesasmallamountofcold-startdataandamulti-stagetraining\n",
      "pipeline. Specifically, we begin by collecting thousands of cold-start data to fine-tune the\n",
      "DeepSeek-V3-Basemodel. Followingthis,weperformreasoning-orientedRLlikeDeepSeek-R1-\n",
      "Zero. UponnearingconvergenceintheRLprocess,wecreatenewSFTdatathroughrejection\n",
      "samplingontheRLcheckpoint,combinedwithsuperviseddatafromDeepSeek-V3indomains\n",
      "suchaswriting,factualQA,andself-cognition,andthenretraintheDeepSeek-V3-Basemodel.\n"
     ]
    }
   ],
   "execution_count": 3
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "## 构建 Chain 表达式",
   "id": "e6557e1bbbb43c0b"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "我们将使用 `langchain` 库整合文档、prompt 模板和输出解析器，来构建一个 Chain 表达式，以便在 DeepSeek R1 和 Ollama 之间进行交互。",
   "id": "8451efd6c8338fb8"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-01-31T11:07:12.891293Z",
     "start_time": "2025-01-31T11:07:06.743135Z"
    }
   },
   "cell_type": "code",
   "source": [
    "from langchain_core.output_parsers import StrOutputParser\n",
    "from langchain_core.prompts import ChatPromptTemplate\n",
    "from langchain_ollama import ChatOllama\n",
    "\n",
    "model = ChatOllama(\n",
    "    model=\"deepseek-r1:1.5b\",\n",
    ")\n",
    "\n",
    "prompt = ChatPromptTemplate.from_template(\n",
    "    \"Summarize the main themes in these retrieved docs: {docs}\"\n",
    ")\n",
    "\n",
    "# 将传入的文档转换成字符串的形式\n",
    "def format_docs(docs):\n",
    "    return \"\\n\\n\".join(doc.page_content for doc in docs)\n",
    "\n",
    "\n",
    "chain = {\"docs\": format_docs} | prompt | model | StrOutputParser()\n",
    "\n",
    "question = \"What is the purpose of the DeepSeek project?\"\n",
    "\n",
    "docs = vectorstore.similarity_search(question)\n",
    "\n",
    "chain.invoke(docs)"
   ],
   "id": "255c7c6f56cafd3d",
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\"<think>\\nOkay, I need to summarize the main themes from these retrieved documents about DeepSeek-R1 improving in math and coding tasks over previous models. Let me start by reading through each point carefully.\\n\\nFirst document: It talks about DeepSeek-R1 not showing significant improvement over V3 on software engineering benchmarks. Future versions will address this by implementing rejections sampling on data or incorporating asynchronous evaluations to improve efficiency.\\n\\nSecond document: DeepSeek-R1 avoids length bias in GPT-based evaluations and outperforms other models on math tasks, surpassing OpenAI's o1-1217 by a large margin. They also observed similar performance on coding algorithm tasks where reasoning-focused models dominate.\\n\\nThird document introduces their pipeline for DeepSeek-R1, which uses two RL stages: one for discovering improved patterns and aligning with human preferences, and another for SFT stages using cold-start data and a multi-stage pipeline. They fine-tune V3-Basely first, then use a reasoning-oriented RL like R1-Zero.\\n\\nNow, putting this together:\\n\\nDeepSeek-R1 starts by improving in math but not much in software engineering. To make better across multiple tasks, they might add rejections sampling or async evaluations.\\n\\nThey have a pipeline for their model with two RL phases and SFTs to seed reasoning. They use cold start data, fine-tune V3-Basely first, then R1-Zero.\\n\\nSo the main themes are:\\n\\n1. Software Engineering improvements in math tasks.\\n2. Future enhancements needed for software engineering by adding rejections or async evaluations.\\n3. Pipeline aspects of DeepSeek-R1: two RL stages and SFT integration.\\n4. Fine-tuning process using V3-Basely first, then R1-Zero.\\n5. Performance trends where reasoning models outperform others in certain tasks.\\n\\nI think these cover the main points from each document.\\n</think>\\n\\n**Summary of Key Themes:**\\n\\n1. **Software Engineering Improvements**: DeepSeek-R1 demonstrates gains in math tasks but not significantly in software engineering benchmarks, suggesting areas for future enhancement through rejections sampling or asynchronous evaluations.\\n\\n2. **Pipeline Development**: Their pipeline for DeepSeek-R1 includes two RL stages and integrated SFT (Supervised Filtering Techniques) phases, which are crucial for its reasoning capabilities.\\n\\n3. **Fine-Tuning Strategy**: The model starts with fine-tuning V3-Basely before transitioning to R1-Zero, using a multi-stage approach to improve performance.\\n\\n4. **Task Performance Trends**: Reasoning-focused models outperform others in coding algorithm tasks, where they dominate, indicating the potential for broader improvements in various domains.\\n\\n5. **Future Considerations**: The documents outline steps toward addressing software engineering bottlenecks by exploring new data strategies and evaluation methods.\""
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "execution_count": 4
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "## 带有检索的 QA",
   "id": "a0478e3cba122e93"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "最后，我们将检索到的相似文本段落与完整文档合并为统一的上下文（context）。随后，将用户提问（question）与这个上下文结合，按照 RAG_TEMPLATE 的格式整合，最终输入到模型中进行问答处理。",
   "id": "19f1b3d4d5775ddb"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-01-31T11:07:15.358637Z",
     "start_time": "2025-01-31T11:07:12.892376Z"
    }
   },
   "cell_type": "code",
   "source": [
    "from langchain_core.runnables import RunnablePassthrough\n",
    "\n",
    "RAG_TEMPLATE = \"\"\"\n",
    "You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\n",
    "\n",
    "<context>\n",
    "{context}\n",
    "</context>\n",
    "\n",
    "Answer the following question:\n",
    "\n",
    "{question}\"\"\"\n",
    "\n",
    "rag_prompt = ChatPromptTemplate.from_template(RAG_TEMPLATE)\n",
    "\n",
    "retriever = vectorstore.as_retriever()\n",
    "\n",
    "qa_chain = (\n",
    "    {\"context\": retriever | format_docs, \"question\": RunnablePassthrough()}\n",
    "    | rag_prompt\n",
    "    | model\n",
    "    | StrOutputParser()\n",
    ")\n",
    "\n",
    "question = \"What is the purpose of the DeepSeek project?\"\n",
    "\n",
    "# Run\n",
    "qa_chain.invoke(question)"
   ],
   "id": "e07618c65d72f3bf",
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'<think>\\nOkay, I need to figure out the answer to the question: \"What is the purpose of the DeepSeek project?\" \\n\\nFirst, I\\'ll look through the provided context. It seems like the context is about a research paper or something related to DeepSeek-R1 and its improvements over other models.\\n\\nI see mentions of DeepSeek-R1 having made significant advancements in various fields like engineering tasks, mathematics, coding algorithms, and even software engineering benchmarks. The mention of \"deep learning\" comes up multiple times, which makes me think that the project is focused on deep learning applications or specific areas within it.\\n\\nThere are references to openAI-o1-1217, OpenAI\\'s model, and other models like DeepSeek-R1 Zero. This suggests that DeepSeek-R1 might be an improvement over existing models, possibly addressing certain limitations in their performance across different domains.\\n\\nI also see mentions ofRL (reinforcement learning) stages within the pipeline for training the model. The context says that on math tasks, DeepSeek-R1 outperforms OpenAI-o1-1217 by a large margin, which indicates that it\\'s excelling in specific task areas but maybe doesn\\'t cover all deep learning applications comprehensively.\\n\\nPutting this together, the project is likely focused on improving existing models, particularly in areas like mathematics and software engineering. It might be introducing new techniques or strategies to enhance performance beyond what was previously achieved by other models.\\n</think>\\n\\nThe purpose of the DeepSeek project appears to be focused on enhancing and improving existing deep learning models, particularly excelling in mathematics and software engineering tasks. The project leverages reinforcement learning (RL) to address specific limitations and improve performance across various domains.'"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "execution_count": 5
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "## 总结",
   "id": "d60e415857e2f105"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "这篇文章展示了如何使用 DeepSeek R1 和 Ollama 实现本地 RAG 应用。我们首先加载并分割 PDF 文档，然后初始化向量存储，接着构建 Chain 表达式，最后进行带有检索的 QA。\n",
    "\n",
    "可以扩展这个例子，使用不同的文档加载器、文本嵌入模型、向量存储、模型和输出解析器，以适应不同的应用场景。"
   ],
   "id": "aa276a3c3fb983e2"
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
